An Efficient Image Based Approach for Extraction of Deep Web Data

نویسنده

  • S. SREENIVASA
چکیده

The Internet presents a huge amount of useful information which is usually formatted for its users, which makes it difficult to extract relevant data from various sources. Deep Web contents are extracted by submitting the queries to semi structured Web databases and the returned data records are enwrapped in dynamically generated Web pages. Extracting structured data from deep Web pages is a challenging problem due to the underlying intricate semi structures of such pages. Until now, a large number of techniques have been proposed to address this problem, but all of them have inherent limitations because they are Web-page-programming-language dependent. As the popular two-dimensional media, the contents on Web pages are always displayed regularly for users to browse. This motivates us to seek a different way for deep Web data extraction to overcome the limitations of previous works by utilizing some interesting common visual features on the deep Web pages. In this paper, an image based approach that is Webpage programminglanguage-independent is proposed using XML like structure and the help of VIPS algorithm. This approach primarily focuses the visual features on the hidden Web pages to implement deep Web data extraction, including data record extraction and data item extraction. We also propose a new evaluation measure revision to capture the amount of human effort needed to produce perfect extraction. Our experiments on a large set of Web databases show that the proposed vision-based approach is highly effective for deep Web data extraction.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

High Fuzzy Utility Based Frequent Patterns Mining Approach for Mobile Web Services Sequences

Nowadays high fuzzy utility based pattern mining is an emerging topic in data mining. It refers to discover all patterns having a high utility meeting a user-specified minimum high utility threshold. It comprises extracting patterns which are highly accessed in mobile web service sequences. Different from the traditional fuzzy approach, high fuzzy utility mining considers not only counts of mob...

متن کامل

Digital surface model extraction with high details using single high resolution satellite image and SRTM global DEM based on deep learning

The digital surface model (DSM) is an important product in the field of photogrammetry and remote sensing and has variety of applications in this field. Existed techniques require more than one image for DSM extraction and in this paper it is tried to investigate and analyze the probability of DSM extraction from a single satellite image. In this regard, an algorithm based on deep convolutional...

متن کامل

Data Extraction using Content-Based Handles

In this paper, we present an approach and a visual tool, called HWrap (Handle Based Wrapper), for creating web wrappers to extract data records from web pages. In our approach, we mainly rely on the visible page content to identify data regions on a web page. In our extraction algorithm, we inspired by the way a human user scans the page content for specific data. In particular, we use text fea...

متن کامل

Parallelization of Rich Models for Steganalysis of Digital Images using a CUDA-based Approach

There are several different methods to make an efficient strategy for steganalysis of digital images. A very powerful method in this area is rich model consisting of a large number of diverse sub-models in both spatial and transform domain that should be utilized. However, the extraction of a various types of features from an image is so time consuming in some steps, especially for training pha...

متن کامل

A New Method for Improving Computational Cost of Open Information Extraction Systems Using Log-Linear Model

Information extraction (IE) is a process of automatically providing a structured representation from an unstructured or semi-structured text. It is a long-standing challenge in natural language processing (NLP) which has been intensified by the increased volume of information and heterogeneity, and non-structured form of it. One of the core information extraction tasks is relation extraction wh...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011